Probability distribution for binomial process:
\[
P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n - k}
\]
where
\(k\) successes in \(n\) trials; and
the probability of success on any trial is \(\theta\) .
Multiple experiments: for session \(i\) there are \(k_i\) hits from \(n_i\) spins.
At the session level the relationship between spins and hits can be modelled as a binomial process. For each session we are considering the number of spins (or “trials”) denoted by \(n\) and the number of hits (or “successes”) denoted by \(k\) . And \(\theta\) is the probability of success on any spin.
The binomial distribution allows us to calculate the probability of \(k\) given specific values for \(n\) and \(\theta\) . So this assumes that you know \(\theta\) … But what if you don’t? Certainly for most slot machines you don’t know the probability of winning. If you did then you probably wouldn’t be very tempted to play.
Suppose we had observations of \(k\) and \(n\) , and wanted to estimate \(\theta\) .
Probability distribution for Bernoulli process:
\[
P(k | \theta) = \theta^k (1 - \theta)^{1 - k}
\]
where
\(k = 0\) indicates failure;
\(k = 1\) indicates success; and
the probability of success on any trial is \(\theta\) .
A Bernoulli process is the atomic component of a binomial process and it considers a single trial at a time.
Bayes’ Theorem
\[
p(\theta|y, X) = \frac{p(y|X, \theta) p(\theta)}{p(y)} \propto p(y|\theta) p(\theta)
\] where
\(y\) are observations;
\(X\) are predictors;
prior — \(p(\theta)\) ;
likelihood — \(p(y|X, \theta)\) ; and
posterior — \(p(\theta|y, X)\) .
Our plan was to use Bayesian techniques to extract the parameters for these distributions.
Maud is quite familiar with these techniques but I had to remind myself of how it all works.
Bayes Theorem relates three important components: a prior, likelihood and posterior.
The prior specifies what we know about the model parameters before considering the data.
The likelihood gives the probability of the data conditional on the model and choice of parameter.
The posterior is what we know about the model parameters after factoring in the data.
A powerful feature is that this can be applied iteratively, with the posterior from one iteration becoming the prior in the following iteration.
Despite the apparent simplicity of this relationship it has historically been rather difficult to apply due to computational challenges.
Note:
The denominator is sometimes called the “Evidence”. It’s just the marginal distribution of \(y\) .
So much for theory. Analytical expressions are rare in practice.
Confounding features:
data are often multi-dimensional;
models have multiple parameters.
So evaluating \(p(\theta|y, X)\) becomes challenging!
One approach to applying Bayes’ Theorem is to use a regular grid of parameter values. This works well and you can see that the shape of the posterior (as represented by points on the grid) evolves with each iteration.
But there’s a major problem with the grid approximation: it scales poorly with the number of model parameters and rapidly becomes intractable.
Stan:
RStan:
We’re not going to stress about the details of Markov chains because we’ll be using a remarkable piece of software called Stan.
Stan is a high level language for writing statistical models. There’s a package for it in R. Both are being aggresively developed.
Note:
It uses Hamiltonian Monte Carlo, which applies some general principles from Physics to provide much more efficient sampling. Yes, it takes a little longer to generate each sample, but those samples are guaranteed to explore parameter space much more effectively.
The Hamiltonian MC is much more efficient than vanilla MCMC or the Metropolis Algorithm.
Stan Workflow
Choose a model.
Write Stan program (likelihood and priors).
Stan parser converts this to C++.
Compile C++.
Execute compiled binary.
Evaluate results. Optionally return to 2.
Inference based on posterior sample.
To use rstan you need a .stan and a .R .
Stan Skeleton
We’ll start by building a model to address Maud’s first burning question: what is the hit rate?
This is the basic structure of a Stan file. It consists of a number of blocks, not all of which are required.
The most important blocks are:
data which specifies the observations;
parameters which lists the parameters of the model; and
model which defines the likelihood and priors.
data {
int<lower = 0> N;
int<lower = 0, upper = 1> y[N];
}
parameters {
real<lower = 0, upper = 1> theta;
}
model {
y ~ bernoulli(theta);
}
Our first Stan model has two elements of data: the number of samples and the outcomes of the individual spins, encoded as a binary vector.
We know that theta must lie between 0 and 1.
We specify a Bernoulli likelihood.
We haven’t given a prior on theta, so by default Stan will use a uniform prior: any value of theta is equally likely.
library (rstan)
# Use all available cores.
#
options (mc.cores = parallel:: detectCores ())
trials <- list (
N = nrow (spin),
y = spin$ success
)
fit <- stan (
file = "bernoulli.stan" ,
data = trials,
chains = 2 , # Number of Markov chains
warmup = 500 , # Warmup iterations per chain
iter = 1000 # Total iterations per chain
)
This is the corresponding R code. Things to note:
it loads the rstan library;
data are transferred to Stan via a list (with names corresponding to those in the Stan file);
Stan is invoked via the stan() function; and
specify the number of chains as well as number of warmup and compute iterations per chain.
The “chains” are initially a little mysterious, but these are basically just independent series of samples. They aid better sampling of parameter space and also facilitate parallelism.
The warmup is a number of iterations during which the distribution should, in principle, achieve an equilibrium.
Note that there are 1000 iterations in total per chain, of which 500 are warmup, which leaves 500 active iterations. So, in total there are 1000 (2 times 500) active iterations.
Note:
Set the number of cores to be used for parallel execution (otherwise chains are run in series).
A traceplot shows the sampled values as a function of iteration number.
Let’s take a moment to see where those samples came from. We specified 2 chains, each with 500 warmup iterations and 1000 iterations in total, leaving 500 active iterations per chain. So that accounts for the 1000 samples in the stanfit object.
The chains are uncorrelated, but successive samples within each chain are clearly related to each other. Ideally we are aiming for good “mixing” across the domain of the distribution.
What’s happening during the warmup phase? The parameters of the MC are getting tuned to optimally explore the parameter space (one aspect of this is choosing the step size). By the time that the active iterations start the sampler should have more or less reached an equilibrium.
[1] "stanfit"
attr(,"package")
[1] "rstan"
samples <- as.data.frame (fit)
head (samples)
theta lp__
1 0.3095256 -1226.968
2 0.3072672 -1227.066
3 0.3124851 -1226.912
4 0.3200480 -1227.132
5 0.3140362 -1226.914
6 0.2929604 -1228.812
[1] 1000
We don’t actually get the posterior directly. The stanfit object is really just a collection of samples from the posterior.
Note:
We can access the samples using the extract() function, which yields a list. It’s normally easier to just convert these samples to a data frame.
The number of samples that we get is precisely the total number of active iterations across all chains.
The samples for success probability look like they have a Normal distribution and are centred on the value that we expected from earlier analysis, around 30%.
print (fit, probs = c (0.025 , 0.5 , 0.975 ))
Inference for Stan model: bernoulli.
2 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=1000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
theta 0.31 0.00 0.01 0.29 0.31 0.33 413 1
lp__ -1227.43 0.04 0.70 -1229.56 -1227.14 -1226.91 348 1
Samples were drawn using NUTS(diag_e) at Tue May 15 17:45:07 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Use summary() to get information for each chain.
This is very neatly presented in the stanfit summary information.
There’s the mean value of theta as well as the lower and upper quantiles of the distribution. Don’t think of a confidence interval. These are quantiles of the density. Close to the median they are stable, but further into the tails they can be noisy.
The average hit rate is in good agreement with our earlier result.
Maud is immediately more comfortable with this result because, even with a uniform prior, the success rate is constrained to a reasonable domain between 0 and 1.
n_eff is the effective sample size. This takes into account that the samples are not independent (this is a consequence of correlation in the Markov Chain). This is a vital component of the Monte Carlo version of the Central Limit Theorem.
We can see the degree of correlation in a simple autocorrelation plot.
data {
int<lower=0> N;
int hits[N];
int spins[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
hits ~ binomial(spins, theta); // Likelihood
theta ~ beta(2, 2); // Prior
}
Let’s take a look at the session data. We need to treat these data as coming from a binomial distribution.
The data that we are providing consists of two vectors, hits and spins.
We are assuming a binomial relationship between hits and spins with a success probability denoted by the parameter theta.
As an experienced gambler Maud knows that the likelihood of the hit rate was not uniformly distributed between 0 and 1. We could have factored this knowledge into the computation by applying a more informed prior. For example, a broad beta distribution peaked at 0.5. The results are essentially unchanged though.
Note:
The ~ represents a stochastic relationship. An = would indicate a deterministic relationship.
The beta distribution is the conjugate prior for the binomial likelihood (this means that the corresponding posterior will also be a beta distribution).
print (fit, probs = c (0.025 , 0.5 , 0.975 ))
Inference for Stan model: binomial-beta-prior.
2 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=1000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
theta 0.31 0.00 0.01 0.29 0.31 0.33 280 1
lp__ -1228.96 0.03 0.75 -1231.15 -1228.66 -1228.45 545 1
Samples were drawn using NUTS(diag_e) at Tue May 15 17:46:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
This is very neatly presented in the stanfit summary information.
There’s the mean value of theta as well as the lower and upper quantiles of the distribution. Don’t think of a confidence interval. These are quantiles of the density. Close to the median they are stable, but further into the tails they can be noisy.
The average hit rate is in good agreement with our earlier result.
Maud is immediately more comfortable with this result because, even with a uniform prior, the success rate is constrained to a reasonable domain between 0 and 1.
n_eff is the effective sample size. This takes into account that the samples are not independent (this is a consequence of correlation in the Markov Chain). This is a vital component of the Monte Carlo version of the Central Limit Theorem.
We can see the degree of correlation in a simple autocorrelation plot.
The nature of MCMC is that successive samples of the parameters are correlated. As a result the samples cannot be considered as independent and so we need to adjust the sample count accordingly.
Let’s move onto Maud’s second burning question, quantifying the RTP.
In an ideal world RTP would be close to 1. However, this is not the case. A lower RTP obviously favours the casino.
Looking at the distribution of RTP values we can see that there is appreciable variation between sessions. We’d like to quantify the mean RTP and get an idea of the associated uncertainties.
Now this distribution is definitely not Normal. The long tail is a dead giveaway. Maybe it could be described by a log-Normal though? This makes sense too based on the multiplicative nature of slot machine returns.
data {
int<lower=0> N;
real rtp[N];
}
parameters {
real<lower = 0> mu;
real<lower = 0> sigma;
}
model {
rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
}
generated quantities {
real simulated[25];
for (i in 1:25) {
simulated[i] = lognormal_rng(log(mu) - sigma * sigma / 2, sigma);
}
}
Now our Stan code specifies a log-Normal for the likelihood, with two parameters, mu and sigma.
We’re not going to specify a prior for either of the model parameters, but simply constrain them to being positive real numbers.
Note:
The mean of lognormal() is exp(mu + sigma * sigma / 2).
It’s good practice to do a posterior predictive check. We do this by generating simulated samples from the posterior distribution. These samples are specified in the generated quantities block. It looks like there is pretty good agreement between the data and the model.
These simulated values take into account the uncertainties in the model parameters too.
Inference for Stan model: lognormal-rtp.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu 0.80 0.00 0.07 0.69 0.76 0.80 0.84 0.94 1901 1
sigma 0.72 0.00 0.05 0.62 0.68 0.71 0.75 0.83 1852 1
lp__ -16.12 0.02 1.01 -18.77 -16.48 -15.81 -15.41 -15.16 1759 1
Samples were drawn using NUTS(diag_e) at Tue May 15 17:46:53 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
The average RTP is around 80%.
Here’s the Kicker
What’s the probability of breaking even?
[1] 0.25117
What’s the probability of doubling your money?
[1] 0.05319
Since we have access to the posterior distribution we are able to calculate all manner of statistics without relying on (poor) asymptotic approximations. For example, we can find the probability of breaking even or doubling our money.
These figures are enough to make Maud reconsider her gambling fixation.
Q3: Hit Rate per Combination
description
symbols
payout
0
1x mouse
🐭
1
1x cat
🐱
2
2x mouse
🐭🐭
5
2x cat
🐱🐱
10
3x mouse
🐭🐭🐭
20
3x cat
🐱🐱🐱
100
data {
int<lower=0> N;
real rtp[N];
}
parameters {
real<lower=0, upper=1> theta[6];
real<lower=0> sigma;
}
transformed parameters {
real<lower=0> mu;
mu = theta[1] * 1 + // Payline 1 -> 1x
theta[2] * 2 + // Payline 2 -> 2x
theta[3] * 5 + // Payline 3 -> 5x
theta[4] * 10 + // Payline 4 -> 10x
theta[5] * 20 + // Payline 5 -> 20x
theta[6] * 100; // Payline 6 -> 100x
}
model {
rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
theta[1] ~ beta(2, 2); // Mode = 0.5
theta[2] ~ beta(2, 4); // Mode = 0.25
theta[3] ~ beta(2, 5); // Mode = 0.2
theta[4] ~ beta(2, 8); // Mode = 0.125
theta[5] ~ beta(2, 10); // Mode = 0.1
theta[6] ~ beta(2, 16); // Mode = 0.0625
}
Maud got the idea for this while pruning her roses. The idea is to use the observations of RTP per session to infer the relative frequency of the winning combinations. Since we know the payout for each combination it seems like this might be possible, but I was not convinced that we had enough information.
We’ve defined a vector of rate parameters, theta, and another parameter, pay, which is derived from them. The relationship shows up in the transformed parameters block and the six different payouts are used as weights.
The priors are based on Maud’s gut feel for the relative frequency of each of the payouts. They carry a little information about the probabilities of each payline, but not a lot.
The traceplot paints an interesting picture. It’s readily apparent here the importance of the warmup iterations: the first few samples for some of the parameters are very far from the core of their distribution. This could be improved by providing more restrictive priors.
mean se_mean sd 2.5% 97.5% n_eff Rhat
theta[1] 0.140072477 1.445106e-03 0.0886783678 0.0186286589 0.352326667 3765.615 0.9993099
theta[2] 0.068064522 6.711297e-04 0.0424459690 0.0102927430 0.168731470 4000.000 1.0001725
theta[3] 0.028892776 2.894141e-04 0.0183041579 0.0033953293 0.072914250 4000.000 0.9998855
theta[4] 0.014594931 1.437237e-04 0.0090898869 0.0019018620 0.036523596 4000.000 0.9999596
theta[5] 0.007441395 7.525103e-05 0.0047592927 0.0009537422 0.019112895 4000.000 1.0000193
theta[6] 0.001517744 1.482361e-05 0.0009375275 0.0002257116 0.003729426 4000.000 1.0005894
The 1x payout is triggered on average every 7 spins.
The 100x payout is triggered on average only every 659 spins.
The results are interesting. We’ve derived values for the frequency of each of the winning combinations. Granted there is some uncertainty in those values, but not much.
The small payouts are relatively frequent (designed to keep you playing the game) but the big payouts are rather rare.
These are the vital statistics for determining the characteristics of the game. So we have effectively reverse engineered it.